Developing Corpora for Statistical Graphical Language Models

نویسندگان

  • Andrew O'Sullivan
  • Laura Keyes
  • Adam C. Winstanley
چکیده

In this work Statistical Graphical Language Models (SGLMs), a technique adapted from Statistical Language Models (SLMs), are applied to the task of graphical object recognition. SLMs are used in Natural Language Processing for tasks such as Speech Recognition and Information Retrieval. SGLMs view graphical objects as belonging to graphical languages and use this view to compute probabilistic distributions of graphical objects within graphical documents. SGLMs such as N-grams require large corpora of training data, which consist of graphical objects in contextual use (real world graphical documents). Constructing corpora is an important stage in developing the models and many issues need to be addressed. This paper discusses the development of graphical corpora and presents approaches to some of the problems encountered.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A new model for persian multi-part words edition based on statistical machine translation

Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...

متن کامل

Trameur: A Framework for Annotated Text Corpora Exploration

Corpus resources with complex linguistic annotations are becoming increasingly important in the work of language specialists. They often need to perform extensive corpus research, including Natural Language Processing (NLP), statistical modelling and data visualisation. Our software system, called Trameur, aims at making these analyses possible within a single graphical user interface. It relie...

متن کامل

A Computational Platform for Development of Morphologic and Phonetic Lexica

Statistic approaches in speech technology, either based on statistical language models, trees, hidden Markov models or neural networks, represent the driving forces for the creation of language resources (LR), e.g. text corpora, pronunciation lexica and speech databases. This paper presents the system architecture for rapid construction of morphologic and phonetic lexica for Slovenian language....

متن کامل

Presenter: HMW Category: graphical models Preference: Oral Polylingual Topic Models

Statistical topic models are a useful tool for analyzing large, unstructured document collections [1, 2]. Such collections are increasingly available in multiple languages. Previous work on bilingual topic modeling [4] has focused on aligning pairs of translated sentences. In contrast, we consider “loosely parallel” corpora, in which tuples of documents in different languages are not direct tra...

متن کامل

Language Models as Representations for Weakly Supervised NLP Tasks

Finding the right representations for words is critical for building accurate NLP systems when domain-specific labeled data for the task is scarce. This paper investigates language model representations, in which language models trained on unlabeled corpora are used to generate real-valued feature vectors for words. We investigate ngram models and probabilistic graphical models, including a nov...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006